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We propose a general learning algorithm for solving optimization problems, based on a simple 
strategy of trial and adaptation. The algorithm maintains a probability distribution of possible 
solutions (configurations), which is updated continuously in the learning process. As the probability 
distribution evolves, better and better solutions are shown to emerge. The performance of the 
algorithm is illustrated by the application to the problem of finding the ground state of the Ising 
spin glass. A simple theoretical understanding of the algorithm is also presented. 
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Many problems in science and engineering can be formulated in terms of optimization problems. In these problem 
we often want to find a set of optimum values for a set of variables that minimizes a given function of the variables. 
When the number of variables is large, especially in cases where there are a number of local minima, an exact solution 
is usually impossible to obtain; the aim is thus to find near-optimal solutions. 

A few optimization methods, based on ideas from physics and biology, have been developed recently which lead to 
rather good general purpose algorithms |^ for solving optimization problems. They have been successfully applied 
to a wide range of practical problems. One of them is a stochastic algorithm known as optimization by simulated 
annealing (OSA) The method is based on an analogy with thermodynamics, in particular, on the way in which the 
atoms in a liquid find their minimum energy configuration of a crystal, when the liquid is annealed or cooled slowly. 
The algorithm consists of Monte Carlo dynamics performed at a sequence of effective temperatures to simulate the 
effect of annealing. The stochastic dynamics allows access to a larger region of configuration (solution) space than 
simple "quenching" methods, and thereby helps in reaching a good solution. However, the Monte Carlo search is based 
on evolution of a single starting configuration and is thus often confined to a limited region of configuration space and 
is consequently not efficient in searching through configuration space. Many ideas have been proposed to make OSA 
efficient. For example, the OSA based on the multi-canonical sampling technique has proved quite effective in the 
spin glass problem [Q , Q and the traveling salesman problem . Another general purpose algorithm which has been 
used extensively is the Genetic Algorithm (GA) The Genetic Algorithm demonstrates the importance of keeping 
many configurations ("species") in the optimization process. The algorithm mimics the principles of evolution (using 
crossover, mutation, etc. to update the configurations). GAs have been applied to a large range of problems in a 
wide variety of topics. To apply GAs to large-size problems, however, a significant number of configurations need to 
be retained (thus imposing memory requirements) and in addition, many independent runs may be needed to avoid 
missing out on good solutions. Thus GAs may not be efficient for optimizing large-size problems. 

In this paper, we propose a learning algorithm, which is as general as the genetic algorithm, but does not require 
storing many configurations explicitly. What is kept and updated in our learning algorithm are probability weights 
associated with each spin that enable one to generate new configurations probabilistically, with lower energy con- 
figuartions favored; this allows many configurations to be kept implicitly. The learning process is constructed in such 
way that the evolution of the probability distribution will lead to better and better configurations (solutions). An 
early version of the algorithm has been applied to solve the traveling salesman problem 0. We first present the 
general algorithm and demonstrate analytically that in its simplest version the algorithm leads to progressively more 
optimal solutions. In the second part of the paper we present explicit results for the problem of determining the 
ground state of the Ising spin glass. 

To describe the algorithm, let us consider the general optimization problem of finding the values of a set of variables 
or parameters {A^, i = 1,2, ...n} such that the function F(Ai, A2, A„) is a minimum; F represents a (free) energy 
or an objective function. As is done in GAs, we encode the variables A^ using binary digits, and write the function as 
F{cri, U2, cfn), where {uk, k = 1,2,..., N} are binary numbers assuming values and 1. Equivalently, we may use 
spin variables Si with values —1 and 1. Since we will test our algorithm on the spin glass problem, we will employ the 
spin description from now on. 

The aim of the optimization is to find the spin configuration {si} that gives rise to the smallest value of F. The 
learning algorithm for this problem can be formulated as follows. For each spin Si, a weight Wi is assigned. The 
probability for choosing Sj = 1 is defined as Wi / {1 -\- Wi) and the probability for choosing = — 1 is then 1/(1 -I- Wi) 
(Initially Wi is set to be 1, so that the probabilities of choosing Si — 1 and Si ~ —1 are equal). The basic ingredients 
in the algorithm are trial and adaptation: first select a configuration {si}; then modify the weights {w, } to favor 
configurations with smaller values of F. 

Selection of Spin Configuration: A configuration is selected by choosing the initial configuration {si} with the 
probability determined by the weights {wi}. In the simplest version of the algorithm, this configuration is used in the 
evaluation for updating the weights. The performance of the algorithm can be greatly improved by implementing a 
local optimization on the configuration (changing Si from 1 to —1 and vice versa). The resulting configuration after 
the local optimization will then be used for the updating (evaluation) of the weights. The simplest way to perform the 
local optimization, which is quite general, consists of individual spin flips to lower the value of F. More sophisticated 
algorithms can also be used. But they are likely to be problem dependent. 

Evaluation of the configurations obtained and modification of weights: Starting from the second trial, the configu- 
ration obtained in the current trial is compared with the configuration obtained in the previous trial. Let {si, S2, ■■■} 
denote the current configuration and {si,S2,..} denote the previous configuration; let the corresponding values of 
the function to be minimized be F and F' respectively. The comparison of these two configurations leads to the 
modification of weights described by the equation: 

^new ^ ^old^-MF-F')(s,~s',)/2^ ; ^ ^^^^ 

where N is the total number of spins in the systems, and a represents the modification rate of weights (learning rate 
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of the algorithm). According to this rule, if the functional value of the current configuration, F, is lower (higher) than 
the previous one, F' , the weight will be modified to favor (disfavor) the current spin configuration. The new weights 
will be used in the selection of the next configuration according to the prescription given earlier and the procedure 
repeated. As learning advances, the spin configuration will be gradually "frozen" into a near-optimal configuration. 

We study the simplest version of the model analytically to understand how the learning process leads to better 
solutions. We consider the limit where the learning rate a is very small, and consequently Wi changes very slowly. In 
the spirit of the "adiabatic approximation" we can write down an equation describing the change of the weight as a 
function of time (defined as the number of trials used): 



dwi 



Sl,...SN si ...... 



Wi 



(2) 



where P(si, s^r) is the probability of generating the configuration {si, Sjv}- Since each spin is chosen indepen- 
dently, the probability P can be written as 

p(si,s2,...,sjv) =n^(*»)' 



where p(sj) is the probability that the ith spin in the configuration assumes the value Sj. As we have described earlier, 
p{si) is given by Wi/{1 + Wi) if Sj = 1 and 1/(1 + Wi) if Si = —1. These can be combined into a single expression as 
follows: 

_ (1 + Si)wi/2 + (1 - Si)/2 _l SiWi-l 

P\Si) — ... — „ + 



1 + Wi 



2 2 Wi + 1 



In the limit a ^ we can expand the exponential function in the above equation and keep only the leading order 
term. We have 



dWi 
Hi 
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Straightforward manipulations lead to 
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Given the expression for we can then evaluate the change in the average functional value as a function of time. 
The average is defined as 

F=Y,F{{si})P{si,...,SN), 

and ^ is given by 



dF ^ ^ 1 dp{s^) 



(5) 



Now using the fact that 



1 dp{si) Si 1 dwi 1 Wi — \.dwi 

p{si) dt p{si) (1 + Wi)"^ dt 2wi * Wi + 1 dt 



we obtain 



dF a X - 



E P{si,...,SN)F{si,...,SN){Si 



I LSl,...SN 



Wj-l 
Wi + 1 



(6) 



3 



It can be seen from the above equation that F in the learning process is gradually reduced < 0), and thus better 
and better configurations are obtained as learning advances. The reduction is diminished when the learning process 
converges, i.e., when the probability of = 1 approaches one {wi approaches oo) or zero {wi approaches 0); in these 
limits Si — and ^ — 0- This result provides some theoretical basis for our learning algorithm, which is quite 

valuable as analytical understanding of optimization algorithms is typically difficult to obtain. 

We illustrate the performance of the learning algorithm in the case of the Ising spin glass systems described by the 
Edwards- Anderson Hamiltonian 

E = - J.jSiSj, (7) 

where the sum includes only the nearest neighbors (4 for two-dimensional systems and 6 for three-dimensional systems) ; 
the exchange interactions Jy , between the spins Si = ±1, are independent quenched random variables which assume 
the values ±1 with equal probability. Clearly the optimization method can be used for other distributions of J^. The 
aim of the optimization is to find the spin configuration {si} with the lowest value of E for a given set of , or the 
total energy per site e (defined as E divided by the total number of spins in the system). The problem at hand then 
is to optimize e(si, S2, sn). 

Our studies on the spin glass problems were done using a 200MHz SGI Power Challenge and a cluster of SGI Indigo 
workstations. The CPU times quoted in this paper have all been converted to the equivalent CPU times of the SGI 
Power Challenge. We have studied two-dimensional systems of size ranging from 5 x 5 up to 200 x 200 and three 
dimensional systems of size 4 x 4 x 4 up to 25 x 25 x 25. Let us first look at the performance of the simplest version of 
the algorithm without local optimization. Here we focus on a sample of size 20 x 20. The learning rate is taken to be 
a = 0.1. Figure 1 shows the energy obtained vs. the CPU time t spent (the data are taken every 800 iterations). It 
also shows the lowest energy obtained up to time t. These data clearly show the convergence of the learning algorithm 
as described in our analysis above. The simple version of the algorithm is certainly not efficient as should be expected, 
but the result obtained is still quite impressive in view of the simplicity of the algorithm (it reduces the energy from 
around to about —1.2). Better results can be obtained using a smaller learning rate a, but this will be more time 
consuming. 

The simplest version of the algorithm can be improved dramatically even with a simple local optimization where 
single spin flips are attempted to lower the energy after the configuration is selected. In this local optimization we 
make a few passes through the lattice and a spin is flipped if the flipping leads to a lower energy configuration; the 
procedure stops when the conflguration can not be improved further by flipping any individual spin on the lattice. The 
locally optimized conflguration is then used for comparison with the previous conflguration (also locally optimized). 
It should be mentioned that this simple local optimization technique does not utilize any special feature of the spin 
glass problem; it can thus also be easily applied to other optimization problems. Figure 2 shows the lowest energy 
obtained vs. the CPU time spent using this technique (the same 20 x 20 sample is used). For comparison we use 
four different learning rates: a = 0.1, a = 0.5, a = 2.5, and a = 12.5. It is clear that with the use of the single spin 
flips, the algorithm is made much more efficient. The dependence of the algorithm on the learning rate is also clearly 
illustrated in the graph. With large a, the learning process converges quickly, but the result obtained is worse than 
the result obtained using a slower learning process. There is a tradeoff between obtaining a quick solution or a good 
solution, which is controlled entirely by the choice of the appropriate value of a in the algorithm. 

In addition to the simple local optimization based on single spin flips, we can incorporate more sophisticated 
local optimization techniques in the algorithm. The use of sophisticated local optimization techniques is likely to be 
problem dependent. For the spin glass problem, we use a local optimization technique similar to the Kernighan-Lin 
variable-depth search algorithm used in the graph-partioning problem and the traveling salesman problem [p^ . 
The idea is to replace the search for one favorable spin flip by a search for a favorable sequence of spin flips, using the 
energy of system to guide the search. A sequence of spin flips in the variable-depth search is obtained sequentially 
as follows. We start with the spin flip at a selected location in the system and search its neighbors to flnd the most 
favorable spin flip as the next spin flip in the sequence (the most favorable spin flip is one that gives rise to the lowest 
energy among all spin flips considered). In general, after the kth spin flip the neighbors of all the spins which have 
been flipped will be searched to flnd the most favorable spin flip as the {k + l)st spin flip of the sequence (the flipped 
spins in the flrst k spin flips of the sequence will not be considered again). 

Let AE(k) denote the total accumulated change in energy from the energy of the starting conflguration due to the 
k spin flips. To cut short the search that most likely will not lead to a better conflguration, we apply the stopping 
condition: the search will stop at the kth step if AE{k+ 1) is greater than 2d — 2 {d is the spatial dimension). We also 
set a maximum number of spin flips allowed, denoted by n. So the search will stop when k = n. After the search is 
completed, we choose fco corresponding to the minimum AE. If AE{ko) < 0, then we move the starting conflguration 
to the energetically better conflguration corresponding to fc = fco (by adopting the flrst fco spin flips); otherwise we 
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abandon all the spin flips generated in the search and keep the original configuration. A new search will be initiated 
starting from a new location in the system. 

The implementation of the local optimization based on the variable-depth search is accomplished by keeping a 
linked list of starting locations for the search. This is initialized to include all lattice sites at the beginning. The 
starting location of the search is taken and removed from the top of the list. If a better configuration is obtained in 
the search, the locations of all flipped spins together with their neighbors will be appended to the list (if they are not 
already in the list). The local optimization will end, if the linked list is empty. 

Fig. 3 shows the optimization results based on n = (no local optimization), n = 1 (single spin flips are used), 
n = 9 and n — 100 (variable-depth search). The same 20 x 20 sample was used. The same learning rate a = 2.5 is 
used in all the optimizations. It is clear that better local optimization leads to better overall results for the learning 
algorithm. For n — 9 and n = 100 the results based on local optimization alone are already quite close the optimal 
value (due to smaller system size), so the improvement due to the learning process is small. However for the larger 
systems the improvement due to the learning process can be signiflcant. One advantage of our learning algorithm is 
that any local optimization technique can be incorporated easily into the algorithm. The performance of the local 
optimization is not sensitive to the choice of n as long as n is not much less than 10. In fact many searches are stopped 
after the first few steps because of the stopping condition we employ. The time taken to perform a local optimization 
with n ~ 100 is typically 10 to 15 times longer than the time required for a local optimization with only single spin 
fiips. 

The overall performance of the algorithm is summarized in Table I (for 2D systems) and Table II (for 3D systems) . 
The number of samples we use for each system size ranges from 320 for the smallest system to 10 for the largest system. 
The average energy (per site) we obtained and the average time taken to reach the lowest energy configuration are 
listed, together with the average number of iterations used. In obtaining the results presented in the tables, local 
optimization with n = 100 (for 2D systems) and n = 125 (for 3D systems) was used in the learning algorithms. To 
compare with the results obtained in the literature, we take our best results obtained with a = 0.1, and fit our data 
using the form cl — e^o + cL^'^ to obtain the energy of the infinite system. The results are Coo = —1.4028 ± 0.0019 
for 2D systems and Coo = — 1.7857 ± 0.0026 for 3D systems. These are consistent with the best results quoted in the 
literature jl^- In particular, for two-dimensional systems, Simone et al. [|l^ use an exact algorithm based on 
the branch and cut technique to find the ground states of spin glass systems with system size up to 50 x 50. They 
obtain the extrapolated result Coo = —1.4022 ± 0.0003 using the same form of the fitting function. It is not clear, 
however, whether their technique can be efficiently implemented for 3D systems or not (Finding the ground state of 
the 3D spin glass is an NP-complete problem). For 3D systems, the most efficient algorithm appears to be the one 
using a hybrid of Genetic Algorithm and local optimization. Pal used the hybrid algorithm to study 3D systems 
of sizes up to 14 X 14 X 14 and he obtained Boo = —1.785708 ± 0.000075 based on the same form of the fitting function 
we used. Our 3D result agrees well with his result. 

Our algorithm is quite fast compared with most other algorithms. For example, for 14 x 14 x 14 systems, our 
algorithm needs, on average, 1270 seconds with a = 0.5 to obtain the average energy of —1.7865. In comparison, the 
optimization using the hybrid GA implemented by Pal, which itself is much faster than the original genetic algorithm 
approach used by Sutton et al., takes, on average, 23540 seconds per run (30 runs were used) on a 134 MHz SGI Indy 
computer (which is about four times slower the computer we use), to get the same average energy of —1.7865. If we 
do not need especially high accuracy, we can choose a larger learning rate and obtain the result much more quickly. 
With a = 12.5 we can obtain the average energy of —1.7836 in the average CPU time of 53.6 seconds. We can study 
systems of size up to 200 x 200 and 25 x 25 x 25 with a — 12.5 rather easily without sacrificing much accuracy. 

In conclusion, we have demonstrated how a complex optimization problem can be solved by a simple learning 
strategy of trial and adaptation. The learning process is somewhat similar to the evolution process in Genetic 
Algorithms. As in GA this algorithm has the advantage that it is based on global searches in configuration space. 
But instead of keeping many configurations explicitly as in GAs we use probability weights to generate configurations 
probabilistically. Thus many configurations are implicitly kept for effective mutation and crossover through the 
updating of the probability weights. For the simple version of the algorithm without local optimization, we have 
shown analytically that the algorithm does lead to better and better solutions. Local optimization techniques can 
also be easily incorporated in the algorithm, which leads to a rather effective optimization method, as we have 
demonstrated in the spin glass problem. We believe that our learning algorithm can be equally effective in other 
optimization problems, in particular the ones where sophisticated local search algorithms have not been found. 

We thank C. Jayaprakash for critical reading of the manuscript and many helpful comments and suggestions. 
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Figure captions. 

Figure 1. Energy per site e vs. the CPU time t used for the optimization of a 20 x 20 systems. The fiUed-circle 
represents the energy obtained at the current time; while the filled square represents the lowest energy obtained up 
to time t. The data are taken every 800 iterations, and the optimization is performed with a = 0.1. 

Figure 2. The lowest energy obtained vs. the CPU time used, for the optimizations with a = 0.1,0.5,2.5 and 12.5. 
Local optimization with single spin flips are used in the algorithm. 

Figure 3. The lowest energy obtained vs. the CPU time used, for the optimizations with n = 0, n = 1, n = 9, and 
n = 100. The learning rate for these optimizations is chosen to be a = 2.5. The same 20 x 20 sample is used. 

Table captions. 

Table I. Average lowest energy, the CPU time, and the number of trials needed (number in parentheses) to reach 
the lowest energy configuration from the optimizations of the two dimensional systems of size L x L with L = 
5, 10, 20, 30, 40, 50, and 200. The values of the learning rate a used are 0.1, 0.5, 2.5, and 12.5. The number of samples 
Ng used is also listed (in parentheses under the system size). 

Table II. Average lowest energy, the CPU time, and number of trials needed (number in parentheses) to reach 
the lowest energy configuration from the optimizations of the three dimensional systems of size L x L x L with 
L = 4,6,8,10,12,14, and 25. The values of the learning rate a used are 0.1, 0.5, 2.5, and 12.5. The number of 
samples Ng used is also listed (in parentheses under the system size). 
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Table I 



Energy per site e, convergence time (seconds), and iteration steps 



L (iV,) 


a = 0.1 


a = 0.5 


a = 2.5 


a = 12.5 


5 

(320) 


-1.3405 ±0.0051 


-1.3405 ±0.0051 


-1.3405 ±0.0051 


-1.3405 ±0.0051 


0.0008 (1.0) 


0.0008 (1.0) 


0.0008 (1.0) 


0.0008 (1.0) 


10 

(IGO) 


-1.3882 ± 0.0037 


-1.3882 ±0.0037 


-1.3882 ±0.0037 


-1.3882 ± 0.0037 


0.005 (1.8) 


0.004 (1.7) 


0.0()5 (1.8) 


0.004 (1.7) 


20 
(80) 


-1.4019 ± 0.0022 


-1.4019 ±0.0022 


-1.4018 ±0.0022 


-1.4008 ± 0.0022 


0.74 (55) 


0.64 (48) 


0.46 (35) 


0.23 (20) 


30 

(40) 


-1.4007 ±0.0022 


-1.4003 ±0.0022 


-1.3999 ±0.0022 


-1.3984 ±0.0023 


185 (6112) 


51 (1747) 


11 (389) 


2.2 (82) 


40 
(20) 


-1.4001 ± 0.0024 


-1.4003 ±0.0026 


-1.3995 ±0.0026 


-1.3983 ± 0.0026 


1587 (28670) 


287 (5590) 


49 (955) 


11 (246) 


50 
(10) 


-1.4002 ±0.0030 


-1.3999 ±0.0030 


-1.3994 ±0.0030 


-1.3988 ±0.0030 


4503 (53038) 


2959 (9393) 


547(1883) 


150 (579) 


200 
(10) 








-1.3976 ±0.0005 








17259(16893) 
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Table II 



Energy per site e, convergence time (seconds), and iteration steps 





a = 0.1 


a = 0.5 


a = 2.5 


a = 12.5 


4 

(320) 


-1.7453 ±0.0067 


-1.7453 ±0.0067 


-1.7453 ±0.0067 


-1.7453 ±0.0067 


0.006 (1.2) 


0.0006 (1.2) 


0.0006 (1.2) 


0.0005 (1.2) 


6 

(KiO) 


-1.7720 ±0.0028 


-1.7721 ±0.0027 


-1.7721 ± 0.0027 


-1.7714 ± 0.0028 


0.094 (5.8) 


0.07.") (4.8) 


0.077 (.").()) 


0.070 (4.7) 


8 
(80) 


-1.7855 ±0.0029 


-1.7855 ±0.0029 


-1.7849 ± 0.0029 


-1.7827 ±0.0030 


10.8 (208) 


5.1 (109) 


2.2 (50) 


0.73 (18) 


10 

(40) 


-1.7816 ±0.0021 


-1.7811 ±0.0021 


-1.7803 ±0.0022 


-1.7771 ±0.0021 


263 (3082) 


62.9 (765) 


18.7 (245) 


5.3 (71) 


12 
(20) 


-1.7816 ±0.0020 


-1.7813 ±0.0020 


-1.7800 ± 0.0024 


-1.7780 ± 0.0024 


1253 (9027) 


409 (3208) 


76.9 (609) 


27.9 (236) 


14 
(10) 


-1.7874 ±0.0020 


-1.7865 ±0.0020 


-1.7871 ± 0.0023 


-1.7836 ±0.0017 


4305 (20857) 


1270 (6335) 


251 (1312) 


52.6 (261) 


25 
(10) 








-1.7815 ±0.0009 








10080(11259) 
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